Consensi check¶

In [1]:
# library import
from utils import *

Consensi length distribution¶

In [2]:
data = parse_repbase("data/RepBase_subsetv2.multifasta")
length_hist(data, "RepBase")
Stats(data)
Out[2]:

Summary

Number of consensi:17407
Longest sequence (pb):45000
Shortest sequence (pb):40
Average length (pb):2998.0
Median length (pb):2525.0
LTR7735
DNA5621
LINE2347
non_LTR1224
UNKNOWN177
SINE144
SAT121
MSAT29
pseudogene7
simple2
In [3]:
seq_dict = cons_parser("data/consensi/RM2_consensi.fa.classified")
length_hist(seq_dict, "RepeatModeler2")
Stats(seq_dict)
Out[3]:

Summary

Number of consensi:15384
Longest sequence (pb):14783
Shortest sequence (pb):29
Average length (pb):2262.3
Median length (pb):773.0
LTR12565
Unknown2069
LINE513
DNA147
RC34
SINE22
tRNA14
rRNA8
Simple_repeat7
Satellite5
In [4]:
seq_dict = cons_parser("data/consensi/EDTA_consensi.fa")
length_hist(seq_dict, "EDTA")
Stats(seq_dict)
Out[4]:

Summary

Number of consensi:16191
Longest sequence (pb):16685
Shortest sequence (pb):80
Average length (pb):1971.9
Median length (pb):1109.0
LTR10166
DNA4725
MITE1300
In [5]:
df = cons_parser("data/consensi/MITE_consensi.fa")
length_hist(df, "MITE-Tracker")
Stats(df)
Out[5]:

Summary

Number of consensi:10863
Longest sequence (pb):800
Shortest sequence (pb):49
Average length (pb):289.4
Median length (pb):235.0
MITE10863

Repeated consensi¶

For EDTA and RepeatModeler2:

  • nb of reads 60000

For MITE-Tracker:

  • Nb of reads 6000

Coverage of the consensus:

$\text{coverage of the consensus} = \frac{2 \times 150 \times \text{Nb of mapped reads}}{\text{consensus length}}$
In [9]:
reads_hist("data/sam/RepBase_coverage.sam", "data/RepBase_subsetv2.multifasta", "RepBase")
In [6]:
reads_hist("data/sam/RM2_coverage.sam", "data/consensi/RM2_consensi.fa.classified", "RepeatModeler2")
In [7]:
reads_hist("data/sam/EDTA_coverage.sam", "data/consensi/EDTA_consensi.fa", "EDTA")
In [8]:
reads_hist("data/sam/MITE_coverage.sam", "data/consensi/MITE_consensi.fa", "MITE-Tracker")